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INTRODUCTION 

The computer age has finally advanced to a stage where multi-discipline optimization 
can be entertained in the aerospace arena. The various disciplines to be optimized together 
within the aerospace field include not only the technical design areas such as aerodynamics, 
structures, propulsion, and controls, but also the whole engineering process itself. As 
various disciplines are interdependent, the assumption is that the optimized result of a 
single discipline must not dictate inputs for another discipline in a sequential fashion 
leading to the design of the final product. Instead, the disciplines must be optimized in 
parallel, and optimization iterations must be carried out across disciplines until satisfactory 
design is obtained. Engineering processes such as manufacturing are to be included for 
optimization with the objective of minimizing the costs and reducing the time associated 
with the production of the end item. This work is focused on aerodynamic optimization, 
however, the selected approach is viable for multi-discipline optimization as well. 

In the field of aerodynamic optimization, there are three distinct technical areas. The 
first area concerns formulation of geometry functions. These functions either sit atop 
the existing design or represent the design itself. Perturbations of these functions allow 
new designs. Wagner functions, polynomials, sine bumps, B-splines are some examples of 
geometry parameterization. It is critical to select a class of functions which will allow the 
formulation of as complete a design space as possible. This area of optimization is not 
addressed in this work. 

After suitable functions have been chosen to represent and optimize the geometry, 
we need to determine the direction in which the design variables should be changed to 
obtain the optimum design. For example, if the designer is trying to minimize an objective 
function such as Drag, the optimization process requires determination of sensitivity of 
drag to the change in design variable, viz. the sensitivity gradient. The present work is 
targeted to efficient evaluation of these sensitivity gradients. 

Finally, the third area of optimization deals with the determination of the size of 
perturbation applied to the design variable so that one reaches the optimum design via. 
the path of steepest descent. A number of well developed tools are already available to 
accomplish this task. 

Traditionally, optimization has been carried out to improve the design of a single 
discipline using 

• Direct methods, 

• Indirect methods, 
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• and Inverse methods. 

In direct optimization techniques, one begins with the objective function, i.e. the quantity 
to be optimized, on a baseline configuration. Sequentially, one design parameter at a 
time is changed while others are kept fixed to determine the direction in which the design 
parameter should be varied to maximize the objective function. Constraints satisfying 
the final design requirements can be applied during the optimization process. This type 
of approach allows control over final design configuration and will be suitable for multi- 
discipline optimization problems. 

In the inverse method approach, a target result, e.g. pressure distribution, is specified 
and the geometry is modified till the target value is attained. This approach, and indirect 
methods, however, do not allow complete control over the final configuration, and as a 
result have lost popularity in the recent past. 

In this research effort, a “direct” optimization method is implemented on the Cray 
C-90 to improve aerodynamic design. It is coupled with an existing implicit Navier-Stokes 
solver, OVERFLOW 1 , to compute flow solutions. The optimization method is chosen 
such that it can accommodate multi-discipline optimization in future computations. In 
this work, however, only single discipline aerodynamic optimization will be included. 

APPROACH 

The approach to carrying out multi-discipline aerospace design studies in the future, 
especially in massively parallel computing environments, comprises of choosing 1) suit- 
able solvers to compute solutions to equations characterizing a discipline, and 2) efficient 
optimization methods. In addition, for aerodynamic optimization problems, 3) smart 
methodologies must be selected to modify the surface shape. 

Solver : 

An implicit code to solve the Navier-Stokes equations, OVERFLOW, is chosen as it has 
already been mapped to a number of parallel environments such as the TMC CM-5 2 , the 
Intel iPSC-860 3 , the Intel Paragon, and a network of workstations 4 , thus offering a number 
of possibilities for mapping the optimization component. A number of options, such as 
Pulliam and Chaussee’s diagonalized scheme 5 , Block Beam- Warming scheme 6 , and Steger’s 
partially flux split algorithm 7 are coded in OVERFLOW. The first two options are from 
the ARC3D code. In the OVERFLOW code, the ARC3D options are written for an overset 
grid framework thus allowing the possibility of design optimization of configurations that 
do not lend themselves easily to single grid topologies. Additionally, a number of related 
disciplines such as dynamics 8,9 , controls 10 , and propulsion 11 have already been coupled 
with this code, permitting multi-discipline optimization studies in the future. 

Optimization Method: 

As outlined in the introduction, the “direct” optimization technique is selected to im- 
prove the design. The crux of the problem in this method is to determine the sign of the 
sensitivity of the objective function to the variation of a design variable. The conventional 
“brute-force” approach can be extremely expensive for design problems with a large num- 
ber of design parameters requiring as many or more complete solutions at each optimization 
step as the number of design variables. Recently, however, substantial progress has been 
made in determination of sensitivity of the objective function to the design variables using 
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both the continuum and discrete approaches. 12-17 In these new approaches, the direction 
of design variable change is determined from analytical/quasi-analytical expressions as 
opposed to the complete solutions of governing equations. Solution of an algebraic set of 
equations to determine the complete set of sensitivity derivatives is required for a given op- 
timization step. Consequently, for problems where the number of design variables is large, 
as in realistic wing aerodynamic optimization problems, and multi-discipline optimization 
problems, this will prove to be an efficient approach. In the present research effort, the 
discrete quasi-analytical approach to computing the gradient of objective function with 
respect to variations in design variables, is being coupled with the flow equations. 

Only the Euler subset of the complete flow equations in 2-D is being considered at 
this time. The Euler equations in generalized curvilinear coordinates are given by: 

d r Q + d^E + d„F = 0 (1). 

In Eq. (1), r is time, £ and r/ are the curvilinear coordinates, Q is the vector of conservative 
variables, and E and F, are the inviscid flux vectors. 

In vector form 
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More details of these terms can be found elsewhere. 5 To obtain an implicit solution to Eq. 
(1), E and F are linearized in time resulting in 4 x 4 flux jacobian matrices A = and 

B = and the following equation: 


I + hA ^ + hBjf A Q — — -f- 


( 2 ) 


In ARC3D’s 5 approach to solving the above equation, E £, F v , A^, and B v are central 
differenced. A combination of 2nd and/or 4th order smoothing is added to both sides of the 
equation to make the numerical scheme stable. The left hand side is approximated by either 
1) factorization into two directionally independent matrices with 2nd order smoothing 
added to each direction yielding block tri-diagonal matrix systems, or, 2) diagonalization 
of the factorized matrices with combination 2nd/4th order smoothing yielding scalar penta- 
diagonal matrix systems. At steady state, A Q vanishes, and we are left with: 


in the interior, and 


R = d^E + d v F - k 


d*Q 



= 0 


(3a) 


R = Qb- f(Q 0 ,X,p) = 0 (36) 

at the boundaries, where k is constant and (3 is a vector of design variables. Q b and 
Q 0 are values of Q at the boundaries and the interior respectively. X is the vector of 
physical co-ordinates of the grid. Compared to the ARC3D implementation, here, 2nd 
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order smoothing and spectral radius scaling have been neglected thus resulting in only the 
fourth order smoothing terms in Eq. (3a). 

For the optimization problem, in general, the real goal is to find the sensitivity of 
the objective function to the variation of design variables. The objective function may be 
the value of vector Q , or a function there of, at a certain location in the flow field, or a 
quantity that is obtained by integration of Q on a boundary, e.g. Cl- To determine the 
sensitivity of the Euler equations to the variation in design variables, R in general may be 
expressed as: 


R(Q(X,D), X(p), D) = 0 

Similarly, the jth objective function Cj can expressed as: 

Cj = Cj(Q(0), X(P), 0) 


( 4 ) 


( 5 ) 


The goal, however, is to find sensitivity of the jth objective function Cj to the design 
variables 0k. Differentiating the above eq. yields: 
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( 6 ) 


Eq. (6) comprises of five terms. Evaluation of the first and third terms, viz. the single 

depends upon the selection of the objective function Cj. 


row matrices 


dCj 

dQ 


and 


dC; 


dX 


For example, let the objective function be expressed as 

Cj = Y, 

Then, p, the pressure can be easily represented in terms of components of vector Q which 
can then easily be differentiated w.r.t. Q to determine • Similarly, £ x can be expressed 

in terms of spatial coordinates so that Cj can be differentiated w.r.t. X to determine 
[^*"] • The f° ur th term, is the grid sensitivity term. For problems, where design 

parameters are either not geometric, or are such that an existing grid simply needs to be 
translated/rotated, the grid sensitivity term is easy to compute. However, for problems, 
where the design parameters are such that new surfaces have to be defined, this term is 
evaluated by a brute finite-difference approach. Assumption is made that selected objective 
functions will not have an explicit dependence on 0k thus negating the need to evaluate 

the 5th term i.e. 

To evaluate the fourth term, | j in eq.(6), eq.(4) is differentiated with respect to 
0k, the kth design variable, following the procedure outlined in Refs. 14-16, to yield: 
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The assumption that chosen design variables are such that R does not depend upon /?* 
leads to the following system of equations that needs to be solved to determine j j : 


dR 

dQ 


dQ 

dp k 


dR 

OX 


dX 

dp k 


(8) 


In the interior, 


dR 


comprises of A B v and Frechet derivatives of smoothing terms 
At the boundaries \¥£\ is evaluated using Frechet derivatives. For 


dQ 

with respect to Q . At the boundaries ' 

example, Frechet derivative of a fixed inflow boundary would result in contribution of 
identity matrices to the main diagonal for rows corresponding to the boundary. Similarly, 
an extrapolation boundary condition would result in contribution of identity matrices to 
the main diagonal for rows corresponding to the boundary. In addition the same rows 
would also get contribution of negative identity matrices at locations where solution is 
extrapolated from. The second order two-point central differenced stencil of the flux ja- 
cobians and the five-point central differenced stencil of the smoothing term result in a 
(4 x jmax x kmax) x (4 x jmax x kmax ) matrix system of nine diagonals. 


Evaluation of 


dR 

dX 


is made simple by writing the flux jacobians E and F in the 


transformed curvilinear co-ordinates in terms of the flux jacobians in the original physical 
co-ordinates and then using Frechet derivatives. The resulting matrix for ^ is (4 x 

jmax x kmax ) x (2 x jmax x kmax ) with four diagonals where each diagonal entity is a 
block 4x2 matrix. 

In 2-D, there are advantages to solving eq. (8) directly, however, in 3-D, only iterative 
approaches will be promising. Delta formulation of this equation, where approximations 
similar to those made for the flow solver 5 can be used, is being looked into. Till date, in 
the current work, eq. (8) is solved using Cray library routines SSGETRF and SSGETRS, 
where the first routine factorizes the matrix and the second routine solves it using pivoted 
Guassian elimination technique. Only non-zero entries are stored to solve the above system. 
It should be pointed out that the system may be solved explicitly as well such that an 
inverse does not need to be computed leading to substantially less storage requirements. 

j j thus f° un( i is substituted back in eq. (6) to yield the sensitivity gradient. 

The sensitivity gradient can also be found by formulating the adjoint equation. Here, 
we add A R to eq. (6), where A is the Lagrange multiplier. Then A is to be determined 
such that 


dCj^ . j'dJR 
3 dQ 


dQ 

d(i k 


= 0 


thus negating the need to determine j Eq. 9 expressed as: 


dR 

dQ 


— 


dCj 

dQ 


lT 


( 9 ) 


( 10 ) 


5 



is the adjoint equation and has to be solved for A. Components required to be evaluated for 
this equation are same as for the direct method before. Selection of which method should 
be used to solve a particular problem depends upon the number of constraints versus the 
number of design variables. Once A is known, the sensitivity derivative is evaluated by 
computing: 

d£i = \?£i. x i-aftl/a*! =n 

dp k [dX 3 dX\\dPk) 

RESULTS AND STATUS 

The derivation of the adjoint and sensitivity equations, and implementation in OVER- 
FLOW is complete. The approach to testing this optimization implementation is based 
on the fact that the direct “brute-force” optimization method works for practical design 
problems but is limited because of the expense incurred when the number of design vari- 
ables is large. 19 If a faster method of determining sensitivity of the objective function to 
the design variables can be implemented, then the existing optimization procedures can be 
followed by simply replacing the evaluation of the sensitivity derivative with the current 
approach. 

In an effort to test the sensitivity approach, a simplified problem of design of a quasi- 1- 
d convergent-divergent wind-tunnel for a given throat velocity was devised. For the Euler 
equations, in the absence of any vorticity generation, this reduces to the solution of a very 
simple equation, viz. V A = constant , where V is velocity and A is cross-section area. 
NPSOL 18 , a nonlinear optimizer, based on a sequential quadratic programming algorithm, 
was used to drive the optimization procedure. With velocity as the objective function and 
area as the design variable, NPSOL could be used to compute sensitivity of velocity to 
area change by evaluating V A = constant , viz. the “brute-force” method. Also, one could 
supply the sensitivity to NPSOL by differentiating V A = constant . Both approaches were 
tried in this rather simplistic example to yield the same solution and to show the benefit 
of supplying the analytical expression for the gradient. 

In the next step, a shock-free 2-d convergent- divergent wind-tunnel example is chosen 
to test the accuracy and efficiency of the current approach. Thrust at exit is chosen as the 

objective function. Consequently, corresponding to Eq. (3b), and are derived 

for inflow, outflow, symmetry, and tangency boundaries using Frechet derivatives. A two- 
dimensional rectangular grid was generated and flow was computed with the assumption 
of symmetry on the top surface as only the bottom half of the wind-tunnel is being solved. 
The lower boundary of the grid was allowed to move by placing a sine-bump 20 at the throat 
to perturb the baseline rectangular design. Sensitivity gradients computed using the direct 
and the adjoint approaches match identically, however, vary slightly when compared with 
the “brute- force” method. To resolve the differences, the left- and right-hand sides of eq.(8) 
were computed both numerically and analytically. The numerical evaluation of the right- 
hand side of eq. (8), viz. the vector obtained by carrying out matrix vector multiplication 

«<[#]{ 4 ^- 1 is done as follows. Compute R\ to convergence on a grid corresponding 
to fi\. Compute R 2 by carrying out a restart solution on a new grid corresponding to 
P 2 (perturbed value of design variable Pi) for one step. (R 2 — -Ri )/(/?2 — Pi) then is the 
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numerical value of the right-hand side of eq. (8). Similarly, the numerical evaluation of the 
left-hand side of eq. (8) proceeds by computing Q\ to convergence on a grid corresponding 
to /3i and Q 2 to convergence on a grid corresponding to /?2* R\ IS determined as before. 
However, R 2 is now determined by supplying Q 2 as the restart and running the solution 
for one time step. (R 2 — R \)/{^2 — fix) then is the numerical value of the left-hand side of 
eq. (8). Figures 1 and 2 show the excellent comparisons obtained between the numerical 
and analytical evaluations of left- and right-hand sides of eq. (8). In addition notice that 
solution of Fig. 2 is indeed negative of solution of Fig. 1. 

Once, the accuracy of the sensitivity gradients is ascertained, the existing frame- work 
used by the “brute-force” practitioners can be used as before to carry out the design pro- 
cess. Now, however, when a program such as NPSOL requires evaluation of sensitivity 
gradient to determine the optimum design, rather than computing flow solution to con- 
vergence to determine sensitivity gradient, one calls the new routine to determine the 
same. 

The work carried out so far in this project indicates that this approach is much 
more efficient compared to the “brute-force” method for 2-D problems. Based on the 
flowsolver experience, it seems probable that iterative techniques similar to those used in 
the flowsolver can be used in 3-D for the optimization problem as well. However, work 
needs to be carried out in that area. 

Compared to the Analytical representation of the adjoint equation as opposed to the 
discrete representation as used here, following points are noted: 

1) Analytical representation may be more accurate by definition. 

2) Implementation of boundary conditions, especially where discontinuous conditions 
exist, may be extremely hard for the analytical approach. Discrete representation on the 

other hand allows for evaluation of at boundaries using Frechet derivatives of discrete 

boundary conditions in a straight-forward manner. 

3) Either the analytical or the discrete approach can be carried out using implicit or 
explicit techniques. In the aerodynamic optimization literature, analytical approach used 
by Jameson et. al. (13) has been carried out by using explicit techniques where as the 
discrete approach used by others has always been carried out using implicit techniques. 
With in the discrete area, advantages of the explicit approach need to be explored. Note 
that when an implicit flowsolver has been used in conjunction with the adjoint problem, 

one has the advantage of having already computed some of the terms as required by 

the time linearizations of the governing equations. 

A number of other issues need to be explored to understand the pros and cons of the 
new approaches. For example, is accurate evaluation of sensitivity gradient necessary at 
each optimization iteration step? Or, could one reach the final design by carrying out semi- 
converged solutions of optimization iterations? Is the final design path dependent in that 
case? For the discrete implementation of the adjoint method, does the adjoint equation 
need to be based exactly on the governing equation, or could some assumptions be made 
to simplify, for example, the smoothing terms? It is hoped that other researchers in the 
near future will explore these issues and provide answers so that a concrete assessment can 
be made of the viability and pros and cons of the various approaches. Finally, feasibility of 
using ADIFOR (21) should be checked for computing derivative of the complete smoothing 
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terms with respect to Q. 
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Abstract 

Analysis of modern aerospace vehicles requires the 
computation of viscous flowfields about complex 3-D 
geometries composed of regions with varying spatial 
resolution requirements. Overset grid methods allow 
the use of proven structured grid flow solvers to ad- 
dress the twin issues of geometrical complexity and 
the spatial resolution variation by decomposing the 
complex physical domain into a collection of overlap- 
ping subdomains. This flexibility is accompanied by 
the need for irregular intergrid boundary data com- 
munication among the overlapping component grids. 
This study investigates one of the strategies for imple- 
menting such a static overset grid implicit flow solver 
on distributed memory, MIMD computers; i.e., the 
160 node IBM SP2 and the 208 node Intel Paragon. 
Performance data for two composite grid configura- 
tions characteristic of those encountered in present 
day aerodynamic analysis are also presented. 

Introduction 

The complex! t}^ of Computational Fluid Dynam- 
ics (CFD) simulations attempted at present is very 
closely related to the sustained CPU performance of 
the readily available computer resources. Simplified, 
2-D flow analysis simulations can be carried out using 
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conventional high performance workstations on a reg- 
ular basis. However, 3-D unsteady, viscous flow analy- 
sis still requires the very best of computing hardware. 
Most of the current generation of vector supercom- 
puters such as the Cray C-90 and the NEC SX-3 are 
fully capable of providing the compute resources re- 
quired for such simulations. However the high cost of 
such machines and their consequent limited availabil- 
ity have spurred efforts aimed at seeking more cost ef- 
fective approaches to high performance, numerically- 
intensive computing. Most of the ongoing efforts in 
this area are carried out under the umbrella of the 
national High Performance Computing and Commu- 
nications Program (HPCCP). One such approach is 
based on the exploitation of the relatively high degree 
of concurrenc}' and the spatial data locality inherent 
in numerical algorithms used for aerodynamic simu- 
lations. Under these conditions, distributed memory, 
multiple instruction, multiple data (DM-MIMD) com- 
puters offer excellent long-term prospects for greatly 
increased computational speed and memory at a cost 
that may render the 3-D flow analysis around com- 
plex geometries on a routine basis more affordable. 
Among the recent advances in computer hardware 
technologies that lend credence to such expectations 
are the advent of mass-produced high-performance 
Reduced Instruction Set Computing (RISC) micro- 
processor chips, high density Dynamic Random Ac- 
cess Memory (DRAM) chips and high-speed intercon- 
nect networks that are easily scalable to the level of 
hundreds of nodes. The essential remaining ingredi- 
ent required for the success of this mode of comput- 
ing is the development and implementation of under- 
lying numerical algorithms in a manner that is con- 
ducive to retaining high parallel efficiencies when the 
number of processors used range at least in the low 
hundreds. This often requires a complete top down 
anal} r sis of the entire numerical scheme in search of 
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exploitable concurrency associated with various al- 
gorithmic phases and a complete understanding of 
the essential data dependencies. This is followed by 
the design of a parallel implementation strategy that 
is capable of achieving a near optimal load distri- 
bution among all participating computational nodes 
and simultaneously attempts to minimize the inter- 
processor communication costs. Such considerations 
usually require changes in one or more of the follow- 
ing: a) the scheduling of various tasks associated with 
the underlying numerical scheme, b) the manner in 
which the associated data is organized and c) the algo- 
rithms used to perform certain subtasks. This some- 
times leads to radical re-engineering of the existing 
serial implementations. A further complicating factor 
in this endeavor is the lack of advanced software devel- 
opment tools for the current generation of DM-MIMD 
computers comparable to those found on vector super- 
computers to aid in the program development effort. 

An inescapable fact when computing flowfields 
around modern aerospace vehicles is the associated 
geometrical complexity. This is further compounded 
by the presence of regions with widely varying resolu- 
tion requirements surrounding such vehicles. A vigor- 
ous research effort is currently underway in the CFD 
community to develop solution adaptive, unstructured 
grid flow solvers to deal with such geometrical and 
physical complications. However, their suitability for 
high Reynolds number flow simulations over compli- 
cated geometries is yet to be firmly established. On 
the other hand, the use of well proven structured grid 
flow solvers in combination with the overset grid ap- 
proach 2 has proven to be a viable alternative to 
the fully unstructured grid approach for simulating 
high Reynolds number flows around complex configu- 
rations. 

In the overset grid approach 3 , the complex air- 
craft configuration is first decomposed into a set of 
components, each with a relatively simple geometry. 
This is followed by the independent meshing of each 
such component using logically structured curvilinear 
meshes. To ensure adequate spatial resolution of the 
flow field, additional overset grids can be used in crit- 
ical regions based on a-priori knowledge of the flow 
field. Finally, these component grids are overlaid to 
yield a larger composite grid that can be used to com- 
pute flow fields around complex configurations. Such 
an approach gives rise to a locally structured but glob- 
ally unstructured flow solver. 

Overlaying of grids in this manner results in em- 
bedding of outer boundaries as well as the solid body 
regions of one grid within the computational domains 
of the other grids. As a result, the grid points belong- 
ing to the latter grids that lie within an embedded 
solid body region along with some prescribed neigh- 
borhood around it are 'blanked-out’, i.e., excluded 
from the computation. Such points are commonly re- 


ferred to as hole points and the grid points that lie 
in the fringes of these blanked-out regions form arti- 
ficial interior boundaries. They are used to impose 
the influence of the embedded solid body upon the 
surrounding component grid. The union of the outer 
boundaries of the embedded minor grids and the artifi- 
cial interior boundaries delineated by the blanked-out 
regions form the inter-grid boundaries of the compos- 
ite grid system. The values of flow fieid variables at 
grid points lying on these inter-grid boundaries need 
to be obtained through interpolation from the solu- 
tions of the other component grids in which they are 
embedded in. 

The interpolation process required to compute val- 
ues of flow field variables at grid points lying on the 
inter-grid boundaries serves to communicate the in- 
fluence of the solution on one grid to those on the 
other grids. In practice, this intergrid boundary point 
(IGBP) data interpolation process is carried out at 
the beginning of each time step of the time march- 
ing scheme adapted for the flow solvers used within 
each component grid and is referred to as the inter- 
grid communication. The intergrid communication 
scheme seeks the necessary interpolation data from 
the hexahedral computational cell of the donor grid 
that contains the IGBP in question and such cells are 
referred to as the donor cells. Therefore the overset 
grid approach requires the identification of the fol- 
lowing entities in all the component grids: a) hole 
points, b) IGBP’s, c) donor grids and donor cells and 
d) tri-linear interpolation coefficients. For the test 
cases considered in this study, we used DCF3D 4 soft- 
ware running on a workstation to accomplish this task 
as a preprocessing step. It should be noted that this 
intergrid communication process can have a highly ir- 
regular structure depending upon the relative posi- 
tioning of the component grids. The distribution of 
the IGBP’s within the computational space of each 
component grid is generally very non-uniform. In ad- 
dition the corresponding set of donor cells may be dis- 
tributed among multiple donor grids. Conversely each 
donor grid may be contributing data to IGBPs belong- 
ing to many other component grids. Finally, just as 
in the case of the IGBP’s, the donor cells within a 
component grid can have a highly non-homogeneous 
distribution with respect to its computational space. 

The primary objectives of this study are three fold: 
a) design of a scalable parallel implementation strat- 
egy for the overset grid, implicit flow solvers on DM- 
MIMD computers when the number of processors 
range in the hundreds, b) development of intergrid 
communication data structures and inter-processor 
communication strategies for its implementation on 
the DM-MIMD computers and c) validation of the 
parallel implementation strategy and the assessment 
of its scalability as well as the overall performance 
potential through the use of realistic composite grid 
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configurations. Two DM-MIMD computer testbeds 
were chosen for this validation and evaluation process, 
viz. the 160-node IBM SP2 and the 208 node Intel 
Paragon. Two test problems are selected here for the 
performance evaluation of the overset grid flow solver. 
These problems require the solution of the Navier- 
Stokes equations and the use of multiple overset grid 
topology. The first test problem is the 4-grid simu- 
lation of viscous flow past a delta wing with thrust 
reverser jets, flying in ground effect (the Powered-Lift 
configuration). The second test problem is the sim- 
ulation of viscous flow past the FLAC (Fighter Lift 
and Control) wing with deflected leading and trailing 
edge flaps (the High-Lift configuration). This 20-grid 
setup offers an opportunity to evaluate load balancing 
issues and the grid partitioning strategies for realisti- 
cally complex geometries. 

In the following sections, the concurrent algorithms 
for overset grid problems and their parallel implemen- 
tation strategy is summarized. This is followed by de- 
scriptions of the computational test beds and the ge- 
ometry/boundary conditions of the selected test prob- 
lems. Then we present the results of our experiments 
along with some analysis. 


Solution of Overset Grid Problems 


As a prelude to the development of a parallel im- 
plementation strategy, a brief conceptual overview of 
the generic mathematical algorithms underlying the 
overset grid flow solvers is presented in this section. 
It is assumed that within each component grid, the 
Navier-Stokes equations along with the relevant physi- 
cal/numerical boundary conditions are discretized us- 
ing the appropriate spatial and temporal discretiza- 
tion procedures. This in conjunction with the impo- 
sition of the intergrid interpolation conditions at the 
IGBPs results in a system of nonlinear algebraic equa- 
tions for each component grid that can be represented 
by the following generalized vector functions: 

F.( Q" + 1 ,Q" + 1 ,... ( Qr 1 Qm + 1 ’Q") 

= 0, (f = 1,2 i Af). (1) 

where Q" +1 is the vector of discrete flow field vari- 
ables belonging to the ?' — th component grid at the 
time level (t 4- A t) and M is the total number of 
component grids involved. It should be noted that 
F ,• may not be a function of all Q t -, (: = 1, 2, . . . , M). 
The exact functional dependence is determined by the 
relative positions of the overlapping component grids. 

There are a variety of iterative approaches avail- 
able for the solution of the system of equations given 
by Eq. (1). The implicit flow solvers used in this 
study use a non-iterative time marching scheme for 


its solution. In this approach, the system of equa- 
tions is linearized about the already known solutions 
Q?,(i = 1,2,...,M). Then the resulting global sys- 
tem of linear algebraic equations are given by: 



-Fi(Q?,QJ,...,QX f ) \ 
-Fa(Q?,Q?,...,QXr) 

V-F M (Qr,Q 2 n ,.-.,Q^)/ 


( 2 ) 


where A"y = (dF;/<9Q" +1 )(Q?, Q?, . . . , QJ, ) and 
Q ? +1 = Q" + AQ",(i = 1,2, .. . ,M). The off- 
diagonal block matrix elements A;j, (i ^ j) of the 
global Jacobian matrix represent the intergrid cou- 
pling effects between component grids i and j through 
the interpolated values at IGBPs. These block matrix 
elements are themselves sparse matrices with highly 
irregular structure. Due to the use of tri-linear in- 
terpolation for intergrid communication, they gener- 
ally have a maximum of eight non-zero elements per 
row. Again, some of these off-diagonal block matrix 
elements may be null matrices, depending on the rel- 
ative locations of the component grids. The diagonal 
block matrix entries A- 1 - represent the implicit cou- 
pling of variables within a component grid, similar to 
those found in well known uni-grid flow solvers. The 
correction vector (AQ?,AQ 2 ",...,AQ£ ; ) needed to 
update the flow variables in all component grids is 
given by the solution to this large, sparse system of 
linear equations. There are many approaches avail- 
able for the solution of this system of linear equations 
and the method selected should be capable of provid- 
ing a sufficiently accurate solution with a high degree 
of reliability in addition to being amenable to efficient 
implementation on DM-MIMD architectures. In the 
following paragraphs, we conceptualize some of the 
available algorithm alternatives for the solution of Eq. 
(2), and discuss the advantages and disadvantages as- 
sociated with each such alternative. 

The obvious first choice is the fully-coupied ap- 
proach, where the system of equations (2) is di- 
rectly inverted. While such a direct inversion scheme 
would lead to an unconditionally stable time march- 
ing scheme for the overset grid problem, it would be 
prohibitively expensive in terms of computer resource 
requirements (CPU time and memory), for solving 
problems of practiced interest to the computational 
aeroscience community. In addition, due to the highly 
irregular sparsity structure of the coefficient matrix, 
the direct inversion of Eq. (2) would not lend itself 
to an efficient implementation on DM-MIMD comput- 
ers. An alternative avenue within the context of the 


3 



fully-coupled approach is to seek a solution to Eq. (2) 
through a matrix-free iterative scheme, which is de- 
signed, if feasible, to be significantly more economical 
both in terms of memory and CPU time requirements 
and at the same time be more amenable to efficient 
implementation on DM-MIMD machines. We defer 
the consideration of such a solution scheme to future 
efforts. It should be noted that the use of a geometric 
multigrid approach to solve Eq. (2) has already been 
reported in the literature 5 . 

The alternative to the fully-coupled approach to 
solving of Eq. (2) is the partitioned analysis. In this 
approach, some of the off-diagonal block matrix en- 
tries, which are responsible for the intergrid coupling 
effects are moved to the right hand side of Eq.(2). This 
is effected by evaluating their contributions based on 
the temporally extrapolated approximations to the 
relevant elements in vectors Q- 1+1 ,(i = 1,2,... ,M). 
These predicted values are usually obtained as a suit- 
able linear combination of their values at the previous 
time levels, n, ( n — 1) etc. The primary motivation for 
this approximation is the resulting decoupling across 
the inter-grid boundaries, of the solution of the large 
system of equations represented by Eq. (2). Conse- 
quently, the solution of Eq. (2) is accomplished by 
solving a series of smaller sub-systems of linear equa- 
tions represented by it’s diagonal block matrix entries. 

There are two commonly used variants to the par- 
titioned analysis approach. If the effect of all the 
off-diagonal block matrix entries in Eq. (2) are to 
be approximately represented on its right hand side, 
based entirely on the extrapolated values to the dis- 
crete field variables required during intergrid commu- 
nication, then the system will be solved through an 
approach similar to a block- Jacobi scheme with un- 
equal block sizes. If on the other hand, the matrix 
in Eqn. (2) or some permuted form of it is reduced 
to a block lower or upper triangular matrix by ap- 
proximately representing the effects of only some of 
its off-diagonal block matrix entries on the right hand 
side through extrapolation in time, then the underly- 
ing system is solved by an approach akin to the classi- 
cal block-Gauss-Siedel(GS) method. In this staggered 
approach, the effect of some of the off-diagonal block 
matrix entries are represented on the right hand side 
using the most recently computed discrete field values, 
instead through temporal extrapolation. A majority 
of the currently available uni-processor and shared- 
memory multiprocessor implementations of the over- 
set grid flow solvers falls into this category. A third 
approach, which is a hybrid of the above two ap- 
proaches is also feasible. In this multilevel method, 
clusters or groups of component grids are formed first. 
This is followed by the application of block GS like 
algorithm within the groups and block-Jacobi like al- 
gorithm across the groups. 

A direct consequence of this partitioned analysis ap- 


proach to solving the system of equations (2) is it’s 
conditional stability with respect to the time step size 
At. In order to avoid numerically unstable compu- 
tations, time step size At has to be bounded by a 
value determined by the highest temporal frequency 
component present in the solution of the overset grid 
problem. In addition, the severity of the stability re- 
strictions is also likely to depend on the fraction of the 
IGBPs relative to the total number of grid points and 
the characteristics of the flow field in regions where 
the intergrid boundaries are located. For some over- 
set grid problems, these restrictions are likely to prove 
to be too severe, giving rise to solution schemes that 
are unconditionally unstable for ail practical purposes. 
Therefore it is assumed that for the class of overset 
grid problems of interest to this study; a) the tran- 
sient response is primarily dominated by the relatively 
low frequency components and b) the component grids 
are designed such that the placement of intergrid 
boundaries in critical flow regions are avoided. Con- 
sequently, the partitioned analysis approach is likely 
to prove to be a cost-effective alternative for solving 
the system of equations (2). As in the case of block- 
Jacobi vs. block-GS schemes, the restriction placed 
on the value of At due to numerical stability con- 
siderations is likely to be more severe in the case of 
the first variant of the partitioned analysis approach. 
Such restrictions placed on At may be alleviated to 
some extent through the use of sub-iterations within 
a time step. 

In spite of the above mentioned drawbacks, the par- 
titioned analysis approach can provide several signif- 
icant computational and software engineering advan- 
tages over the fully-coupled approach. Among these 
are; 1) ability to use proven and independently devel- 
oped discretization/solution algorithms within each 
component grid involved, 2) preservation of high de- 
gree of software modularity and 3) excellent prospects 
for realizing scalable parallel implementations on DM- 
MIMD computers. Furthermore, within the context 
of the partitioned analysis approach, incorporation of 
additional coupled disciplines such as controls, ther- 
mal analysis etc. can be accommodated relatively eas- 
ily. 

In this paragraph we examine the algorithmic de- 
tails of the block Jacobi- variant of the partitioned 
analysis approach for overset grid problems. This 
variant is represented by the following system involv- 
ing a block diagonal coefficient matrix: 
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Parallel Implementation Strategy 


A" 

0 

0 


where the right hand side vectors are defined by: 

R" = -F?(Q?,QJ,...,Q?,...,QSr) 

= -F?(Q?, (&..., Q?,...,Q5,) (4) 

and AQ? = Q? - Q", (j = 1, 2, . . . , M). The tempor- 
lly extrapolated values of the discrete flow variables 
required for the intergrid data interpolation process 
are given by formulae of the type: 

(m-l) 

q ; = E «q. r*. 

Jb=0 

and Qjt are appropriately chosen constants. Following 
algorithmic facts are evident from the above analy- 
sis: a) all intergrid data dependencies appear only in 
the right hand side vectors, b) all intergrid data in- 
terpolation and communication requirements can be 
accomplished concurrently and c) all component grid 
sub-problems can be solved concurrently. 

The solution of component grid sub-problems in- 
volves the inversion of block matrices = 

1,2,..., M). Since the component grids are logically 
structured, these matrices are regular sparse, banded 
matrices 6 , with a relatively high bandwidth. Again, 
the direct inversion is an option but not viable due 
to reasons cited before. However, there exist several 
well-proven approximate inversion schemes that have 
been developed over last two decades within the con- 
text of uni-grid computations. Most of these time 
marching schemes are generally characterized by: a) 
low memory requirements and b) amenable to effi- 
cient implementation on DM-MIMD architectures. It 
should be noted that the scheme chosen need not be 
the same for all component grids. The partitioned 
analysis approach offers the possibility of using dif- 
ferent schemes on different grids, if needed. In this 
study we have chosen to use the diagonalized form of 
the block Beam-Warming scheme 6, ' on all component 
grids. 
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In this section, we provide a brief overview of some 
of the DM-MIMD architectural features that influ- 
enced our parallel implementation strategy followed 
by an abbreviated discussion of some salient features 
of the implementation. The DM-MIMD implementa- 
tion of an overset grid flow solver based on explicit 
update of the intergrid boundary values presents sev- 
eral options. This is primarily due to the MIMD 
characteristics of the architecture. Here, we provide 
a discussion of the options available and the factors 
influencing the choices. For the remainder of this dis- 
cussion, we assume a DM-MIMD computer with a 
fixed but moderate to large number of processors and 
an overset grid problem involving a fixed number of 
grids. The system software on the current generation 
of DM-MIMD computers does not allow dynamic pro- 
cessor allocation/de-allocation, once a job is initiated 
on a fixed subset of it’s processors on a space shared 
basis. Furthermore, in this study, we preclude the 
possibility of further subdivision of any of the com- 
ponent grids through the introduction of additional 
inter-grid boundaries. Although such subdivision may 
facilitate computationally more efficient and a scal- 
able implementation on DM-MIMD architectures, it’s 
impact on the solution integrity, numerical stability 
and the overall convergence rate is currently not well 
understood. However, it should be noted that such 
implementations for both multi-block and overset grid 
problems already exist 8, 10 . Also, this approach is a 
subset of the implementation adapted is this study. 
The computational load associated with each grid, 
largely a function of the number of grid points, is 
also assumed to be fixed during the entire simulation. 
However, we allow for the possibility that differences 
may exist across component grids in the following: a) 
time marching scheme used for the advancement of 
the solution and b) the physical effects included in 
the simulation. 

The time marching schemes used within each com- 
ponent grid possess a certain degree of concurrency 
that can be exploited through data parallelism. In the 
block-GS like variant of the overset grid flow solver, 
each component grid is processed in a predetermined 
sequence. Consequently, the degree of exploitable con- 
currency at any given instant is limited to the data 
parallelism available within the component grid being 
processed at that instant. This results in a situation 
where the degree of exploitable concurrency may vary 
as a function of the elapsed time. Such an implemen- 
tation of an overset grid, implicit flow solver on the 
Connection Machine CM-2 has already appeared in 
the literature 11 . 

In contrast, the use of the block- Jacobi like variant 
allows more than one component grid to be processed 
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concurrently. This permits the exploitation of an ad- 
ditional degree of concurrency, available across all or 
some fixed subset of the component grids involved. 
This extra level of parallelism is generally exploited 
through concurrent processing of multiple component 
grids on distinct clusters of processors and is com- 
monly referred to as task parallelism. Such an im- 
plementation allows the simultaneous exploitation of 
the task parallelism available across the component 
grids and the data parallelism available within each 
component grid. However, in order to ensure that a 
sufficiently good static load balance exist across all 
the clusters participating in the solution of the over- 
set grid problem, the following condition need to be 
satisfied. The fraction of the total number of proces- 
sors assigned to each cluster should be be directly pro- 
portional to the fraction of total computational load 
associated with the component grids being processed 
on that cluster. 

In the following discussion, we summarize some 
of the advantages and disadvantages associated with 
each of the approaches. A more detailed description 
can be found in Ref. 12. The two factors that have 
a dominant influence over this issue are: a) the vari- 
ation of the degree of exploitable concurrency and b) 
the variation of computational load, across the set of 
component grids. Both of these factors are primarily 
influenced by: a) the type of mathematical model, b) 
the nature of the computational algorithms, and c) 
the number of grid points, used within each compo- 
nent grid. The secondary factors are: a) the nature of 
the physical/numerical boundary conditions applied, 
and b) the number of IGBP’s and donor cells asso- 
ciated with each grid. In most realistic overset grid 
problems, there is a significant variation of both the 
computational load and the available degree of con- 
currency among the participating component grids. 
In some cases, this variation could be as much as an 
order of magnitude or more. 

When a fixed number of processors are used to 
solve an overset grid problem with such a heteroge- 
neous character by processing each component grid in 
a prescribed sequence, two adverse implications arise. 
First, in the case of component grids possessing only 
a reduced degree of concurrency or smaller compu- 
tational loads, it may not be possible to gainfully 
utilize all the allocated processors for performing the 
underlying computational tasks. This would lead to 
idling of some of the assigned processors. Even when 
the computational attributes of the component grid 
are such that all processors can be gainfully utilized, 
grids with smaller computational loads would incur 
higher parallel implementation overheads due to re- 
duced task granularity. This would invariably lead 
to lower parallel efficiency. In addition, one may also 
be compelled to search for alternative algorithms with 
higher degree of extractable concurrency that have the 


potential for being accompanied by higher memory 
and/or arithmetic overhead as well. 

On the contrary, the concurrent computation of ei- 
ther ail or a subset of the participating grids on dis- 
tinct clusters of processors, where the number of pro- 
cessors assigned to each component grid from the fixed 
pool of processors is decided on the basis of their com- 
putational loads and the inherent degree of exploitable 
parallelism, would likely result in an implementation 
with minimal idling of processors and higher overall 
parallel efficiency. This is due to the fact that each 
individual grid would now be computed using only a 
fraction of the total available processors, which would 
invariably lead to higher parallel efficiency compared 
with the case of processing the same grid on all of 
the available processors. Also, given the smaller num- 
ber of processors assigned to individual grids, this ap- 
proach requires algorithms possessing only a moderate 
degree of exploitable parallelism. 

The secondary factors influencing this choice are; 
1) memory requirements for each component grid vs. 
that available on a fixed number of processors, 2) 
I/O performance to and from secondary storage de- 
vices relative to the sustained computational perfor- 
mance and 3) availability of system software to per- 
form processor-to-processor communication between 
two processors who are members of two different 
groups of processors. 

A careful consideration of all these factors resulted 
in our decision to implement the variant of the overset 
grid approach given by Eqn. (3), at this time. This 
non-iterative time integration scheme was adopted 
due to the concurrency it affords across all the par- 
ticipating component grids. In this context, the pro- 
cess of achieving a good load balance across all the 
processors in the pool assigned to the task is some- 
what complicated. Among the factors that hinder our 
ability are: a) upper bound on the memory available 
per processor, b) limitations imposed by system soft- 
ware on the number of active processes per processor 
(currently limited to one), c) each cluster should con- 
sist of an integer number of processors, d) additional 
constraints on the allowable number of processors per 
cluster that may be imposed by the system architec- 
ture and e) restrictions imposed by the mesh parti- 
tioning scheme used within a component grid to main- 
tain an acceptable level of parallel efficiency within 
that component grid. Consequently, the expectation 
to achieve near-optimal load distribution across the 
entire pool of processors is unrealistic. 

Within the constraints cited above, we adapt the 
following approach. The pool of processors is assumed 
to be partitioned into M clusters, where M is the to- 
tal number of component grids involved. Each such 
duster has Pj,(: = 1,2, processors and one 
component grid assigned to it. As a result, we re- 
quire a minimum of M processors be assigned to the 
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problem. The actual number required may be higher, 
due to the limitations on memory available per pro- 
cessor. The total memory available within a cluster 
should be either equal to or greater than that required 
by it’s component grid solver and the intergrid com- 
munication data structures. If the total number of 
processors in the pool, say P, is sufficiently large, the 
solution to the following critical path problem can be 
used to determine the distribution of processors across 
the M clusters; 

minimize [max T,- : 1 < i < M] subject to 

M 

£p, = P:P,e J 

>=1 

PiNi < pi Pi 

where 

Here we have assumed that the cost of inter- 
grid communication is negligible. Also, , ( i = 
1,2, is the number of grid points in the i-th 

component grid; a;,(z = 1,2,...,M) is the normal- 
ized work load per grid point per step in the i-th com- 
ponent grid; = 1,2, is the parallel effi- 

ciency of the i-th component grid; /?,*, (i = 1,2,..., M) 
is the memory required in words per grid point for the 
flow solver used in i-th component grid and pi , (i = 
1,2,..., M) is the memory in words available per pro- 
cessor in the i-th cluster. In general, a,- is a function of 
the mathematical model and the numerical algorithms 
used. For a given type of flow solver, ?;,• is generally 
a implementation dependent non-monotonic function 
of P{ for Axed A r , Jn addition, it may also depend on 
the type of boundary conditions used and the ratio of 
grid dimensions. The critical path problem is some- 
what more complicated when more than one process 
can be assigned to a processor. 

It is also possible to use a multi-level approach to 
load balancing. First, groups of component grids are 
formed such that the total computational load with- 
ing each group is approximately equal. Then pro- 
cessor clusters are assigned to each group along the 
approach described above. This is followed by forma- 
tion of sub-clusters within each cluster of processors, 
again following the same approach. 

In this stud} r , we have not attempted to solve the 
above critical path problem, but instead have sought 
only to obtain a leading order approximation through 
the following: 

M 

Pi = aiNitP/YpiNiY, i = 1,2, . . . ,M - 1. 

1=1 

M — 1 

p si = P - E P ' 

i = 1 


Note that this involves assuming 7?,- = l,(i = 

1,2, Once the cluster sizes are known for a 
given P, it is possible to use a greedy algorithm to ad- 
just the values of P,, (t = 1, 2, . . . , M) as P changes. 

At the beginning of each new time step, the time 
marching process starts by simultaneously interpolat- 
ing and exchanging the temporally extrapolated field 
data necessary for updating the values at the IGBPs 
of all component grids. During this data exchange, 
a subset of processors within each of the M group of 
processors are participating in inter-processor commu- 
nications. This is then followed by the simultaneous 
and independent computation of the updated values 
of the flow fields in all the participating component 
grids. 

The data parallel, Single Program Multiple Data 
(SPMD) implementations of the component grid flow 
solvers can be carried out independently of one an- 
other. This is a direct consequence of the software 
modularity afforded by the overset grid approach. 
Due to the MIMD nature of the architecture, each 
cluster is capable of executing the same SPMD imple- 
mentation of the solver for different component grids. 
The diagonalized form of the Beam-Warming algo- 
rithm found in OVERFLOW 13 formed the basis for 
the data parallel implementation of the flow solver 
used in this study. The details of this DM-MIMD im- 
plementation can be found in Ref. (14). The version 
used in this study is based on uni-partitioning of the 
grid and uses grouped, two-way pipelined Gausssian 
elimination for the solution of non-periodic pentadiag- 
onal systems. The periodic pentadiagonal systems are 
solved using fully balanced, one-way pipelined Gaus- 
sian elimination algorithm 15 . 

The only task that requires close interaction and co- 
ordination among different clusters of processors from 
the software implementation point of view is that as- 
sociated with the tri-linear interpolation and exchange 
of the flow field data at the IGBPs. This data inter- 
polation and exchange has to be carried out in the 
context of grid partitioning dictated by the indepen- 
dent, data parallel implementations of the component 
grid flow solvers within the clusters assigned to them. 
In addition, this phase of the computation should ex- 
ploit as much concurrency as possible with minimum 
of synchronization barriers to maintain the overall ef- 
ficiency of the parallel implementation. This was ac- 
complished through the use of a distributed, concur- 
rent implementation of the interpolation algorithms 
and a loosely synchronous approach to interprocessor 
data communication involving a highly irregular com- 
munication pattern. This intergrid boundary data 
exchange process required the design of a new dis- 
tributed data structure for the processing of IGBPs 
and their associated donor cells. Also a procedure for 
initializing the highly irregular interprocessor commu- 
nication pattern among processors belonging to differ- 



ent groups was required. Further details with regard 
to the distributed data structures used and the pro- 
cedure followed for establishing the inter-group com- 
munication pattern will be available in a future pub- 
lication. 


The Test bed Architectures 

Here we provide a brief description of the two DM- 
MIMD test bed architectures used in this study: the 
160 node IBM SP2 and the 208 node Intel Paragon. 
The SP2 is essentially a set of IBM RS6000/590 work- 
stations connected by a high performance switch with 
a topology of an omega network (a hierarchy of cross 
bar switches). The RS6000/590 workstation is based 
on 66.7 MHz. POWER2 multi-chip RISC processor, 
with a theoretical peak performance of approximately 
250 Mflops on 64- bit data. Each node has a 128 Kbyte 
data cache and a minimum of 128 Mbytes of main 
memory and 2 Gbytes of disk space. ^From the appli- 
cation software perspective, the interconnection net- 
work is capable of moving data between SP2 nodes 
with a latency of approximately 45 microseconds and 
a bandwidth of about 34 Mbytes/sec. 

The Intel Paragon is composed of 208 compute 
nodes, each consisting of two 50 MHz. i8C0/XP RISC 
microprocessor chips connected by a two-dimensional 
mesh interconnection network. The theoretical peak 
performance of the i860/XP is 75 Mflops on 64-bit 
data, with a 16 Kbyte instruction and data caches. 
Each compute node has 32 Mbytes of memory with 
approximately 22 Mbytes available to user applica- 
tion programs. One of the i860/XP chips on each 
node is used solely for inter-processor communication. 
The interconnection network moves data with a la- 
tency of 120 microseconds and a bandwidth of about 
35 Mbytes/sec. 

Both test beds currently support message passing 
programming paradigm. The implementation in this 
study was layered on the message passing libraries 
based on the Message-Passing Interface (MPI) stan- 
dard 16 . The MPI provides a common interface for 
development of portable message passing applications 
on distributed memory concurrent computers and net- 
works of workstations. The MPI functionality in- 
cludes point-to-point and collective communication 
routines as well as support - for process groups and 
communication contexts. The latter two features are 
essential for the development of modular applications 
which incorporate simultaneous use of data and task 
parallelism. 


The Test Problems 

In order to evaluate the computational performance 
of the DM-MIMD implementation of the overset grid 
flow solver, two realistic test problems that typify the 
aerodynamic analysis computations carried out using 
the overset grid approach were used. The first prob- 
lem, referred to as the Powered-Lift configuration sim- 
ulates the viscous flow past a delta wing with two jet's 
in ground effect. The simulation of viscous flow over a 
Fighter Lift and Control (FLAC) wing with deflected 
leading and trailing edge flaps was used as the second 
problem. This is referred to as the High-Lift configu- 
ration in the subsequent discussion. In order to assess 
the impact of the block Jacobi like variant of the par- 
titioned analysis approach on the unsteady flow com- 
putations, two test problems were considered. The 
first one is the viscous flow past a circular cylinder 
while the Powered-Lift configuration was used as the 
other. 


The Powered-Lift Configuration 

The computational setup of this configuration 1 1 
consists of a 60° delta planform in a free stream of 
Mach number 0.064, at 6.4° angle of attack (a), with 
two choked jets located at the inboard trailing edge. 
The jet flow is at a nozzle pressure ratio (NPR) of 1.8 
and is exhausted at an angle of 45° to the chord of the 
delta planform. The delta wing is located at a height h 
above the ground plane, such that h/b — 0.25, where 
b is the wing span. The flow field symmetry about 
the y = 0 plane passing through the center line of the 
delta wing is assumed. This geometry is discretized 
by generating 1) a C-H grid around the delta wing, 2) 
a cylindrical grid around the jet pipe, 3) a jet trajec- 
tory conforming grid, and embedding the three grids 
in 4) a Cartesian ground plane grid (Figs. 1, 2, 3). 
This results in a composite grid of nearly 1 million 
grid points. The interconnectivity among the four 
grids and the hole points created as a result of over- 
laying is determined using DCF3D software 4 . The 
composite grid was found to contain a total of approx- 
imately 40,000 IGBP’s distributed among it’s compo- 
nents. On the delta wing and pipe surface, no slip 
condition is used along with extrapolation of density 
and pressure values from those at one grid point above 
the solid walls. On the ground surface, a moving wall 
condition is used to reflect the experimental condi- 
tions. The free stream conditions are applied on the 
inflow boundary and the three side surfaces of the 
background grid, while extrapolation is used at the 
outflow boundary. At the jet exit plane, the velocity 
and pressure ratio are set to those corresponding to 
the experimental conditions. 
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Fig. 2: The delta wing grid. 



Fig. 3: The pipe and the jet grids. 


The High-Lift Configuration 

The High-Lift configuration is represented by a 
computational model to simulate the flow over the 
FLAC wing with deflected leading and trailing edge 
flaps at a Mach number of 0.18 and a Reynolds num- 
ber of 2.5 million. The gridding strategy was designed 
to facilitate the computation of flow field at different 
flap deflection angles, with a minimum amount of re- 
gridding. This component grid strategy is depicted in 
Fig. 4. Grids terminate at boundaries between fixed 
and moving parts, viz. flaps. The flaps are gridded as 
separate components, so that the flap rotation about 
a hinge line on the lower surface of the wing can be 
accomplished by rotating the flap grid. As the flaps 
rotate, they slide down the upper surface of the wing. 
The flap tips and the internal wing tips that get ex- 
posed when the flaps deflect need to be discretized in 
a manner that can resolve viscous effects (Fig. 5). Due 
to the presence of airfoil sections with extremely sharp 
leading and trailing edges, these tips and the wing tip 
(Fig. 6) can not be discretized using standard wrap- 
around grids. However, the} f lend themselves easily 
to the use of polar grids placed on the tips, with the 
singularit} r located at the leading or the trailing edge 
itself (Fig. 7). Volume grids are then grown from these 
polar grids to cover the air gaps. The extremely thin 
and sharp wing tip is discretized using three grids; 
two polar grids for the leading and trailing edge ar- 
eas and one cartesian grid for the region not covered 
by the polar grids (Fig. 6). The entire wing-flap sys- 
tem is gridded using 18 grids. These 18 component 



Fig. 4: The FLAC wing grids. 


grids are embedded in a fine global grid covering the 
entire set to facilitate good inter-grid information ex- 
change. The fine global grid is in turn embedded in 
a coarse grid spanning large extent of the physical 
space around the wing, resulting in a composite of 20 
overset grids with a total of approximately 1.5 million 
grid points. Again the interconnectivity among the 
IS FLAC grids and the 2 global grids as w T ell as the 
location of their respective hole points are determined 
using DCF3D. The composite grid was found to con- 
tain 140,000 IGBPs. On all the FLAC wing surfaces, 
a no-slip condition similar to one used in Powered- 


9 







Fig. 5: The FLAC wing exposed tips. 


Lift configuration is applied. A no-slip wall condition 
is used on the yz-plane at the wing root to simulate 
the wind-tunnel wall found in the experimental set 
up. The free stream conditions are applied at the in- 
flow boundary and on the remaining side surface of 
the global coarse grid, while extrapolation is used at 
it 7 s outflow boundary. 

Results and Discussion 

In this section, we describe some computational per- 
formance data and unsteady flow computation re- 
sults obtained using the static overset grid flow solver 
implemented on DM-MIMD computers. All perfor- 
mance data reported are for 64-bit arithmetic and im- 
plementations based entirely on FORTRAN. In imple- 
menting a general purpose code such as the one used 
here, issues such as software modularity, extensibility 
and maintainability cannot be entirely overlooked in 
favor of computational performance. The code exten- 
sibility and maintainability issues precludes the de- 
velopers from using excessive amount of “creative” 
programming procedures and instead rely mostly on 
optimizing compilers for achieving good performance 
on modern RISC architectures. The general purpose 
nature of the code and the attendant software modu- 
larity requirements are often in conflict with program- 
ming techniques that enhance temporal and spatial 
locality of data. The locality of data is of utmost im- 
portance to achieving high performance on hierarchi- 
cal memory architectures such as those found in mod- 
ern RISC processors. We have employed a balanced 
approach, whereby the essential software modularity 
was retained while attempting to maximize data local- 
ity within that context. Another issue confronting the 
application software developers on RISC architectures 



Fig. 6: The FLAC wing tip. 



Fig. 7: The polar tip grids. 
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is the relatively high cost of floating point divide op- 
erations and intrinsic functions such as SQRT, when 
compared with the cost of similar operations on tra- 
ditional vector supercomputers. When implementing 
an algorithm such as ARC3D 7 , it is always possible to 
reduce the total floating point operation count in gen- 
eral and the number of floating point divide and SQRT 
operations in particular, by resorting to the storage of 
intermediate data arrays. Very often, this is accom- 
panied by increased memory requirements, measured 
in terms of 64-bit words per grid point. This in turn 
reduces the size of the problem that can be computed 
on a fixed number of processors. Therefore judicious 
compromises are necessary when employing memory 
vs. time optimizations. The degree to which all of the 
above tradeoffs are exercised have a profound impact 
on the observed performance of the implementation. 

Fig. 8 shows the performance of the single grid, 
implicit Navier-Stokes solver on the IBM SP2, for 
5 different problem sizes and processor counts vary- 
ing from 1 to 128. The problem sizes vary between 
approximately 1/4 million to 4 million grid points. 
The performance is measured in terms of time per 
step. The corresponding figure for a single processor 
of Cray C-90 is approximately 7 microsec/pt/step. Al- 
though for a fixed grid size, the time/step continues 
to decrease as the number of processors is increased, 
the parallel efficiency also decreases. For the small- 
est problem size used, the efficiency is only 40% when 
128 processors are deployed. This is primarily due 
to the inevitable increase in parallel implementation 
overheads as a fraction of the total time, when the 
number of processors is increased for a fixed problem 
size. A discussion of various types of parallel imple- 
mentation overheads and their impact on performance 
can be found in Ref. (18). 



Tables (1) and (2) show the total number of pro- 
cessors used and the sizes of the processor clusters as- 
signed to each component grid for a series of runs on 
the IBM SP2, for the Powered-Lift and the High-Lift 


configurations respectively. The tables also show the 
sizes of each component grid and the mesh partition- 
ing used for the data parallel implementation of the 
implicit flow solver within each component grid. This 
mesh partitioning was chosen to provide best possible 
performance of the flow solver, given the size of the 
processor cluster assigned to a particular component 
grid. For the implicit flow solver implementation used 
in this study, it was found that for a fixed number of 
processors, the best performance was generally real- 
ized when following guidelines were adhered to during 
the mesh partitioning process: a) each coordinate di- 
rection is partitioned into two segments before any 
direction is assigned more than two partitions, b) ra- 
tio of partitions in each coordinate direction matches 
as closely as possible to the ratio of grid points in each 
coordinate direction. For a given number of proces- 
sors, it is not always possible to follow both guidelines 
exactly, but the first guideline always takes precedence 
over the second whenever possible. 

As a result of this mesh partitioning strategy, no 
attempt was made to equi-distribute the IGBPs or 
donor cells among all participating processors. Ta- 
bles (3-6) show the distribution of IGBPs and donor 
cells among processors assigned to different grids of 
the Powered-Lift configuration, where the number of 
processors used is 28. As can be seen from the tables, 
the distribution of IGBPs and donor cells among pro- 
cessors is highly non-uniform. It is quite evident from 
the tables that a given processor can act in one of the 
following four modes during intergrid communication 
process: a) only as a recepient of donor cell data from 
other processors, b) only as a provider of donor cell 
data to other processors, c) a combination of (a) and 
(b), d) none of the above. As a result, the processor 
load during intergrid communication process can be 
highly unbalanced. 

Consider the case when processors are acting in 
mode (a). The number of grids and processors supply- 
ing donor cell data can vary widely among the active 
set of processors. In addition, the number of donor 
cells supplied by each donor processor can also have a 
large variation. Now consider the case when proces- 
sors are acting in mode (b). Again the number of grids 
and associated processors receiving donor cell data 
from a donor processor can vary widely across the ac- 
tive set of processors. Also, the number of donor cells 
supplied to each of the recepient processors can also 
be very different. As a consequence, during the inter- 
grid communication phase, the number and length of 
messages received as well as the messages sent by a 
processor can have a large variation across the active 
set. In addition, the sources of the incoming mes- 
sages and the destinations of the outgoing messages is 
widely dispersed across the entire active processor set. 
This results in a highly irregular interprocessor com- 
munication pattern with a wide variation in processor 
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load. 

As can be seen from Table (1), Powered-Lift con- 
figuration is an example of an overset grid problem 
with: a) a relatively small number of component grids 
and b) the sizes of the largest and smallest grids dif- 
fer only by a factor of two. As a result, it is pos- 
sible to realize reasonably good static load balance 
among component grid flow solvers, even with only 
a moderately sized pool of processors. On the oth- 
erhand, the High-Lift configuration is an example of 
an overset grid problem with: a) a moderate num- 
ber of component grids and b) the sizes of the largest 
and the smallest grids differ by a factor as much as 
15. Consequently, it is quite difficult to realize good 
static load balance among the component grid flow 
solvers without a relatively large pool of processors. 
Not all entries in Tables (1) and (2) represent cases 
where the component grid solver loads are in balance 
across clusters of processors. The first entry in both 
tables represent the smallest pool of processors that 
can be used to solve the problems. It is 6 and 22 for 
the Powered-Lift and High-Lift configurations respec- 
tively. Due to the wide disparity in the grid sizes, the 
High-Lift configuration needs a minimum of 105 pro- 
cessors to achieve a good static load equi-distribution 
across all component grids. Some of the other en- 
tries in the tables represent assignment of processors 
to the clusters based on a greedy algorithm, i.e., the 
cluster with the heaviest load at a given time getting 
the most number of additional processors as the size 
of the pool assigned to the problem is increased. Al- 
though this approach does not guarantee a good load 
balance across the entire pool of processors, it ensures 
that the additional processors are put to best possible 
use. 

Fig. 9 shows the performance of the overset grid flow 
solver on the IBM SP2 for Powered-Lift and High-Lift 
configurations, as the size of the pool of processors as- 
signed to the problems is increased. Similar data for 



Fig. 9: Performance on the SP2 for Powered- and 

High-Lift configurations. 


the Intel Paragon is shown in Fig. 10. Although most 
of the cases depicted in these figures are not anywhere 
near a balanced load state, the data indicates contin- 
ued reduction in time required to complete the task 
as the number of processors assigned to the task is 
increased. However, this does not imply that all the 
processors are optimally utilized. In all cases, the time 
per step is determined by the slowest component grid. 
Load imbalances result in idling of processors belong- 
ing to other component grids to varying degrees. 



Paragon Nodes 

Fig. 10: Performance on the Paragon for Powered- and 
High-Lift configurations. 


Table (7) shows the time per step required for 
the intergrid data communication process for the 
Powered-Lift configuration. The size and distribution 
of the pool of processors used is same as those in Ta- 
ble (1). This time was measured by introducing a 
synchronizing barrier across the entire pool of proces- 
sors before the intergrid communication process was 
initiated to eliminate any processor idle time being in- 
cluded under the intergrid communication costs. The 
data indicates that the cost of intergrid communica- 
tion itself is always less than 4% of the total time 
per step, irrespective of the size of the processor pool. 
Similar results are obtained for the High-Lift config- 
uration, even though the number of component grids 
is 20. This validates our assumption that the cost of 
intergrid communication is negligible in spite of the 
highly non-uniform load distribution encountered dur- 
ing this phase of the computation. It also attests to 
the efficacy of the approach used for accomplishing the 
intergrid communications. Table (7) also indicates the 
maximum idle time for any processor in the pool of 
processors assigned to the problem. This provides an 
indication of the worst possible load imbalance that 
exists across the entire set of active processors. 

In this section, we discuss the results of the prelim- 
inary experiments carried out to investigate the effect 
of block-Jacobi like variant of partitioned analysis ap- 
proach implemented in this study, on the computation 
of unsteady flow fields. Fig. 11 shows the time trace 
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of the lift force acting on a circular cylinder obtained 
using the DM-MIMD version of the overset grid solver 
and the Cray C-90 implementation of a similar solver 
13 . Three overset grids were used for this computa- 
tion. The amplitude and frequency of the lift his- 
tories show good agreement, indicating that: a) the 
non-dimensional time step size of 0.125 used for this 
computation and b) locations of the component grid 
overlap regions, are such that the use of block-Jacobi 
variant does not lead to any deleterious effects on the 
solution. It should be noted that this is a problem 
with only one dominant, relatively low frequency com- 
ponent in the solution and therefore does not pose se- 
rious challenges to the partitioned analysis approach. 



Next, Fig. 12 compares the lift histories of the 
delta wing in the Powered-Lift configuration, obtained 
through use of above two implementations. Initially 
they show good agreement, but as the flow fields 
develop, some discrepancies between the two traces 
begin to emerge. Our speculation is that once the 
jet impinges the ground plane, it is likely that pres- 
sure waves are generated that bounce back and forth 
between the underside of the delta wing and the 
ground plane, resulting in highly non-linear interac- 
tions. Such interactions appear to produce compo- 
nents with frequencies high enough to produce differ- 
ences in the solutions obtained through the two ap- 
proaches. In order to examine what effect the time 
step size would have on the solution, we repeated the 
experiment twice, each time reducing the time step 
size by a factor of two. The lift histories obtained are 
shown in Figs. 13, 14. Both approaches show some 
differences and they do not follow similar trends. Al- 
though no firm conclusions can be drawn from this 
preliminary investigation, it indicates that the use of 
block-Jacobi approach can lead to discrepancies at 
least in some unsteady flow computations. 
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Conclusions 


References 


We have successfully implemented an implicit overset 
grid flow solver on DM-MIMD architectures based on 
partitioned analysis approach. The implementation 
facilitates the simultaneous exploitation of data paral- 
lelism available within each component grid and task 
parallelism available across the composite of overset 
grids. This has the potential to enhance the scala- 
bility of the implementation, especially for problems 
with widely disparate component grid sizes. As a 
result of using the MPI standard, the implementa- 
tion was shown to be completely portable across two 
DM-MIMD architectures, the IBM SP2 and the In- 
tel Paragon. The software architecture adapted for 
the implementation allows complete modularity and 
the possibility of deploying different flow solvers on 
different component grids, if necessary. 

Current restrictions imposed by the system software 
prevents the assignment of more than one process to 
a processor for time shared execution. This is seen 
as a major hindrance to accomplishing the following: 
a) a good static load balance across the component 
grids and b) solution of overset grid problems with a 
large number of disparately sized component grids on 
a moderate number of processors. In spite of this diffi- 
culty, we have been able to demonstrate reductions in 
total time per time step on two realistic overset grid 
Navier-Stokes computations with the increasing size 
of the pool of processors assigned to the problems. 
The cost of intergrid communications appears to be 
negligible for the two test configurations used. The 
failure to realize good static load balance with certain 
processor counts leads to significant idling of some of 
the processors assigned to the task. This along with 
parallel efficiency losses within each component grid 
flow solver are the major factors limiting the parallel 
efficiency of the overset grid flow solver. 

For simulations involving unsteady flow computa- 
tions, further studies are needed to gain a better un- 
derstanding of the impact of using block-Jacobi like 
variant of the partitioned analysis approach. Future 
efforts may also be warranted in developing static load 
balancing software to determine the optimum number 
and sizes of processor clusters and the assignment of 
component grids to the clusters, given the size of the 
pool of processors available to the task. The oppor- 
tunity also exists, when only a steady state solution 
of the overset grid problem is required, to explore the 
possibility of allowing component grid solvers to, pro- 
ceed through the time marching process in a com- 
pletely asynchronous manner. Although this may al- 
leviate some load balancing problems, potential exist 
for the appearance of numerical stability problems. 
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Procs. # 


Grid 2 (83,81,47) 

Grid 3 (60,71,52) 

Grid 4 (69,71,35) 

6 

(2,1,1) = 2 

(2,1,1) = 2 

(1,1,1) = 1 

(i,i,i) = i 

7 

(1,1,2) = 2 

(1,2,1) = 2 

(1,2,1) = 1 

(i,i,i) = i 

12 

(2,1,2) = 4 

(2,2,1) = 4 

(1,2,1) = 2 

(1,2,1) = 2 

26 

(2,2.2) = 8 

(2,2,2) = 8 

(2,3,1) = 6 

(2,2,1) = 4 

28 

(2,2,2) = 8 

(2,2,2) = 8 

(2,2,2) = 8 

(2,2,1) = 4 

52 

(2,2,4) = 16 

(2,4,2) = 16 

(2,3,2) = 12 

(2,2,2) = 8 

■■ 

(2,2,7) = 28 

(4,4,2) = 32 

(3,4,2) = 24 

(2,4,2) = 16 

— 

(3,2,5) = 30 

(4,3,3) = 36 

(3,4,2) = 24 

(3,3,2) = 18 

122 

(4,2,4) = 32 

(5,4,2) = 40 

CO 

P 1 

1o 

11 

CO 

o 

(2,5,2) = 20 


Table 1: Grid partitioning for the Powered-Lift configuration on the SP2. 


Grid # 

Grid size 

Case 1 

Case 2 

Case 3 

Case 4 

Case 5 

Case 6 

1 

(62,62,62) 

(1.1,2) = 2 

(1-2,2) = 4 

(2,2.2) = 8 

(2,2,4) = 16 

(2,2,4) =16 

(2,2,4) = 16 

2 

(62,62,62) 

(1,1,2) = 2 

(1,2.2) = 4 

(2,2,2) = S 

(2,2,4) = 16 

(2,2,4) =16 


3 

(99,38,30) 

(1,1,1) = 1 

(1,1,2) = 2 

(2,1,2) = 4 

(2,2,2) = 8 

(2,2.2) = 8 

(2.2,2) = 8 

4 

(49,75,30) 

(1,U) = 1 

(1,1,2) = 2 

(2,3,1) = 6 

(2,2,2) = 8 

oo 

11 

CN 

w 

(2.2,2) = 8 

5 

(99,38,30) 

(1,1,1) = 1 

(1,1.2) = 2 

(2,2,1) = 4 

(2,2,2) = 8 


(2,2,2) = 8 

6 

(49,57,31) 

(1,U) = 1 

(1,1,2) = 2 

(2,2,1) = 4 

(2,2,1) = 4 

(2,3,1) = 6 

(2,2,2) = 8 

7 

(79.49,33) 

(U.l) = 1 

(1,2,2) = 4 

(3,2,1) = 6 

(2,2,2) = 8 

(2,2,2) = 8 

(4,2,2) = 16 

8 

(36,68,40) 

(1,1,1) = 1 

(1.1,2) = 2 

(1,2,1) = 2 

(2,2,2) = 8 

(1,2,3) = 6 

(2,2,2) = 8 

9 

(36,57,30) 

(1,1,1) = 1 

(1,1,1) = 1 

(1,3,1) = 3 

(1,2,1) = 2 

(1,2,2) = 4 

(2,2,1) = 4 

10 

(36,68,30) 

(1,1,1) = 1 

(1,2,1) = 2 

(1,2,1) = 2 

(2,2,1) = 4 

(1,3,2) = 6 

(2,2,2) = 8 

11 

(26,57,30) 

(1,1,1) = 1 

(1,1,1) = 1 

(1,1,1) = 1 

(1,2,1) = 1 

(1,2,2) = 4 

(1,2,2) = 4 

12 

(10.32,50) 

(1,1,1) = 1 

(1.1,1)= 1 

(1,1,1) = 1 

(1,1,1)= 1 

(U,l) = 1 

(1,1,2) = 2 

13 

(14,32,50) 

(1,1,1) = 1 

(U,D = 1 

(1,U) = 1 

(1,1,1)= 1 

(1,1,1) = 1 

(1,1,2) = 2 

14 

(11,32,50) 

(1.1.1) = 1 

(l.l.D = 1 

(1,1,1) = 1 

(l.U) = 1 

(1,1,1) = 1 

(1,1,2) = 2 

15 

(24,55.20) 

(1,1,1) = 1 

(1-1,1) = 1 

(1,1,1) = 1 

(1,U) = 1 

(1,2,1) = 2 

(1,2,1) = 2 

16 

(24,55,20) 

1 (1,1,1) = 1 

! (i,U) = i 

(1,1,1) = 1 

(U,l) = 1 

(1,2,1) = 2 

(1,2,1) = 2 

17 

(24,55.20) 

(1.1.1) = 1 

(U,i) = i 

(1,1,1) = 1 

(1,1,1) = 1 

(1,2,1) = 2 

(1,2.1) = 2 

IS 

(24.55,20) 

(1.1.1) = 1 

d.1,1) = i 

(1,1,1) = 1 

(1,1,1) = 1 

(1,2,1) = 2 

(1,2,1) = 2 

19 

(24,55,20) 

(1.1,1) = 1 

(i,i,i) = i 

(1,1,1) = 1 

(1,1,1) = 1 

(1,2,1) = 2 

■IWtilBwi 

20 

(24,55,20) 

(U.l) = 1 

(i,i,i) = i 

(1,1,1) = 1 

(1,1,1) = 1 

(1,2,1) = 2 



Total 

22 

35 

62 

93 

105 

122 


Table 2: Grid partitioning for the high-lift configuration on the Paragon. 














































ZONE 1 


Proc. ID. 

1 

2 

3 

4 

5 

6 

7 

8 

IGBP 

271 

360 

11 

91 

405 

379 

21 

98 

Donor Zones 

1 

3 

1 

1 

1 

2 

1 

1 

Donor Procs 

1 

10 

1 

2 

3 

8 

2 

5 

Max D.C. /D.P. 

271 

286 

ii 

88 

330 

259 

14 

78 

Min D.C. /D.P. 

271 

1 

11 

3 1 

2 

1 

7 

1 

Avg D.C. /D.P. 

271 

36 

11 

46 

135 

48 

11 

20 

IGBP served 

1592 

8632 

267 

2031 

4730 

1524 

725 

869 

Zones served 

2 

3 

1 

1 

1 

2 

1 

1 

Procs served 

4 

12 

2 

4 

4 

4 

4 

1 

Max IGBP/Proc 

1377 

1286 

233 

965 

2088 

1211 

487 

869 

Min IGBP/Proc 

2 

2 

34 

226 

746 

25 

5 

869 

Avg IGBP/Proc 

398 

720 

134 

508 

1183 

381 

182 

869 


Table 3: Powered-Lift Zone 1 grid-communication details (D.C. — Donor Cells, D.P. Donor Processors). 


ZONE 2 ! 

Proc. ID. 

1 

2 

3 

4 

5 

6 

7 

8 

IGBP 1 

960 

1272 

936 

1274 

2600 

2818 

2559 ] 

2837 

Donor Zones 

1 

2 

1 

2 

l\ 

2 

1 

3 

Donor Procs 

1 

6 

1 

4 

3 1 

9 

4 

11 

Max D.C. /D.P. 

960 

491 

936 

528 

2088 

1211 

1377 | 

1286 

Min D.C. /D.P. 

960 

24 

936 

164 

25 

30 

2031 

2 

Avg D.C. /D.P. 

960 

212 

936 

319 

867 

314 

640 

258 

IGBP served 

2 

778 

0 

1223 

394 

2596 

424 

4853 

Zones served 

1 

3 

0 

3 

1 

3 

2 

3 

Procs served 

1 

8 

0 

9 

4 

8 

9 

11 

Max IGBP/Proc 

2 

248 

0 

348 

330 

680 

271 

868 

Min IGBP/Proc 

2 

3 

0 

1 3 

6 

78 

1 

6 

Avg IGBP/Proc 

2 

98 

o 

136 

99 

| 325 | 48 | 44v 


Table 4: Powered-Lift Zone 2 grid-communication details. 


ZONE 3 

Proc. ID. 

1 

2 

3 

4 

5 

6 

7 

8 

IGBP 

935 

288 

909 

272 

1377 

2225 

12461 

2100 

Donor Zones 

1 

1 

1 

1 

2 

3 

2 

3 

Donor Procs 

1 

1 

1 

1 

4 

6 

4 

6 

Max D.C. /D.P. 

935 

288 

909 

272 

970 

903 

872 

868 

Min D.C. /D.P. 

935 

288 

909 

272 

15 

6 

16 

2 

Avg D.C. /D.P. 

935 

288 

909 

272 

345 

371 1 

312 

350 

IGBP served 

2 

720 

1 

680 

53 

1027^ 

74 

926 

Zones served 

1 

1 

1 

1 

2 

2 

2 

2 

Procs served 

1 

1 

1 

1 

2 

2 

2 

2 

Max IGBP/Proc 

2 

720 

1 

680 

44 

913 

65 

820 

Min IGBP/Proc 

2 

720 

1 

680 

9 

114 

9 

106 

Avg IGBP/Proc 

2 

720 

1 

680 

27 

514 

37 

463 


Table 5: Powered-Lift Zone 3 grid-communication details. 
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ZONE 4 

Proc. ID. 

i 

2 

3 

4 

IGBP 

3448 

2448 

3219 

2330 

Donor Zones 

2 

2 

2 

2 

Donor Procs 

6 

6 

6 

6 

Max D.C./D.P. 

913 

1286 

820 

1188 

Min D.C./D.P. 

225 

28 

248 

32 

Avg D.C./D.P. 

575 

408 

537 

389 

IGBP served 

1667 

164 



Zones served 

3 

2 

3 


Procs served 

7 

4 

7 

4 

Max IGBP/Proc 

903 

104 

813 

143 

Min IGBP/Proc 

2 

11 

1 

9 

Avg IGBP/Proc 

239 

41 

219 

51 


Table 6: Powered-Lift Zone 4 grid-communication details. 


Proc. No. 

6 

7 

12 

26 

52 

100 

108 

122 

Total time/ 
step 

8.899 

6.690 

5.198 

2.358 

1.201 

0.800 

0.761 

0.663 

Min. solver 
petime/step 

5.480 

5.340 

2.770 

1.592 

1.185 

0.738 

0.715 

0.595 

Max. I.G.C. 
time/step 

0.058 

0.055 

0.053 

0.026 

0.027 

0.014 

0.013 

0.020 

Max. Idle 
time/step 

3.600 

1.353 

2.626 

0.766 

0.015 

0.062 

0.046 

0.071 


Table 7: Intergrid communication (I.G.C.) and idle time for the Powered-Lift configuration for various no. of processors 
on the SP2. 
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